CS 498 AML Homework 5

This is a html/pdf render of the Rmd (Rstudio markdown file) which is also attached with correspondingly used CSV files.

Question 1 - Music

music <- read_csv('./music.csv')
music.num_features = dim(music)[2]-2
music["latitude"] <- music["latitude"] + 90
music["longitude"] <- music["longitude"] + 180

A small angle transformation was required here to make the \(latitude\) and \(longitude\) values non-negative. Box-Cox requires the response variables to be non-negative.

Latitude

(a) Fit a linear model

lat.lm <- lm(latitude ~ . - longitude, data = music)
## [1] "The R^2 value is"  "0.290022599470525"

The \(R^2\) value tells us how well the regression explains the training data. In this case our \(R^2\) estimation is rather low (approx. \(0.29\)), explicitely telling us that we should consider a better model or perhaps a transformation. Indeed, The Box-Cox transformation is a method that can search for a power transformation of the dependent variable that improves the regression. One searches for a value of \(\lambda\) that makes residuals look most like a normal distribution.

(b) Box-Cox

lat.bc_info = boxcox(lat.lm, lambda=seq(-1,5))

lat.bc_lambda = lat.bc_info$x[which.max(lat.bc_info$y)] #selecting best lambda
lat.lm_lambda = lm( latitude^lat.bc_lambda ~ . - longitude, data=music)
# now let us see if we have improved our model and Rsquared value

Above, different powers (\(\lambda\)) are teseted on the response variable and the one with the highest log-likelihood is chosen.

## [1] "The new R^2 value is" "0.319712643517031"
## [1] "The Box-Cox optimal lambda is" "3.54545454545455"

Moreover, we have a new \(R^2\) value that has increased to approx. \(0.32\) which clearly demonstrates an improvement from the original \(0.29\). Hence we will be using the Box-Cox model for the rest of the excercise. This is a significant enough increase in accuracy to justify using a more complex model. We now proced to regularization, and more specifically Ridge and Lasso regularizations, we want to see if we can still improve things at this point.

Looking at the residuals vs fitted graph (transformed back to the orignal power), there appears to be a pattern. We expect the residuals to be normally distirbuted, this should make us question out assumption of homoscedasticity.

(c) Regularization

# Lasso L1
lat.l1 = cv.glmnet(as.matrix(music[1:music.num_features]),
                as.matrix(music["latitude"]^lat.bc_lambda),
                alpha=1)
# Ridge L2
lat.l2 = cv.glmnet(as.matrix(music[1:music.num_features]),
                   as.matrix(music["latitude"]^lat.bc_lambda),
                   alpha=0)
## [1] "The minimum log lambda for Lasso L1 is"
## [2] "11.299423869787"
## [1] "The minimum log lambda for Ridge L2 is"
## [2] "14.2997620212641"

plot(log(lat.l1$lambda),
     (lat.l1$cvm)^(1/lat.bc_lambda),
     pch=19,
     col="red",
     xlab="log(Lambda)",
     ylab="CVM",
     main="MSE as function of log(Lambda)",
     sub = toString(c("Lasso L1", "Power scale", lat.bc_lambda)))
points(log(lat.l2$lambda),
       (lat.l2$cvm)^(1/lat.bc_lambda),
       pch=19,
       col="blue")
legend("bottomright", legend=c("alpha=1 L1","alpha=0 L2"), pch=10, col=c("red", "blue"))

Longitude

(a) Fit a linear model

lon.lm <- lm(longitude ~ . - latitude, data = music)
## [1] "The R^2 value is"  "0.335536695339324"

(b) Box-Cox

lon.bc_info = boxcox(lon.lm)

lon.bc_lambda = lon.bc_info$x[which.max(lon.bc_info$y)]
lon.lm_lambda = lm(longitude^lon.bc_lambda ~ . - latitude, data=music)
## [1] "The new R^2 value is" "0.335673855400167"
## [1] "The Box-Cox optimal lambda is" "1.03030303030303"

The Box-Cox transformation does not help because the residuals are not really effected. Since the optimal \(\lambda\) found by Box-Cox is about \(1\), we don’t expect this to do a lot. Since \(\lambda\) is about \(1\), we will use the simpler model.

(c) Regularization

# Lasso L1
lon.l1 = cv.glmnet(as.matrix(music[1:music.num_features]),
                as.matrix(music["longitude"]),
                alpha=1)
# Ridge L2
lon.l2 = cv.glmnet(as.matrix(music[1:music.num_features]),
                as.matrix(music["longitude"]),
                alpha=0)
## [1] "The minimum log(lambda) for Lasso L1 is"
## [2] "-1.04410224116321"
## [1] "the minimum log(lambda) for Ridge L2 is"
## [2] "1.39803346352744"

plot(log(lon.l1$lambda),lon.l1$cvm,pch=19,col="red",xlab="log(Lambda)",ylab=lon.l1$name, xlim = c(-4,10), ylim=c(1900, 3700))
points(log(lon.l2$lambda),lon.l2$cvm,pch=19,col="grey")
legend("topleft",legend=c("alpha= 1 L1","alpha= 0 L2"),pch=19,col=c("red","grey"))

It is clear from our plots that a regularization can help in certain cases. From the above we observe that in case 1 where we consider lattitude, the mean squared error is relatively reduced in both L1 and L2. In this case, both methods help in a very similar way – they both regularize to very similar MSE values – however it is worth mentioning that L1 always allows for a reduced amount of attributes to be considered as lamda increases, hence it already has a small advantage. The clear difference in determining whether L1 is better than L2 is rather seen in the second case, where we now consider longitude. The above plot clearly shows that L1 allows for a smaller MSE compared to L2 for the given minimum lambdas. We can thus conclude that Lasso might be a slight better choice.

From our research we can add to these results the following very useful comment found on stackOverflow:

“Keep in mind that ridge regression can’t zero out coefficients; thus, you either end up including all the coefficients in the model, or none of them. In contrast, the LASSO does both parameter shrinkage and variable selection automatically. If some of your covariates are highly correlated, you may want to look at the Elastic Net instead of the LASSO.” - http://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge/876#876

Question 2 - Default

Here we actually need to get accuracy results from glmnet

default <- read_csv('./default.csv')
default.num_features = dim(default)[2]-1

(a) Fit a linear model

# AUC (area under curve) Plots 
default.lm_cv.lasso <- cv.glmnet(as.matrix(default[1:default.num_features]),
                        as.matrix(default["default"]),
                        family="binomial",
                        type.measure = "auc",
                        alpha=1)
default.lm_cv.ridge <- cv.glmnet(as.matrix(default[1:default.num_features]),
                        as.matrix(default["default"]),
                        family="binomial",
                        type.measure = "auc",
                        alpha=0)
default.lm_cv.elastic<- cv.glmnet(as.matrix(default[1:default.num_features]),
                        as.matrix(default["default"]),
                        family="binomial",
                        type.measure = "auc",
                        alpha=0.5)

plot(default.lm_cv.lasso, sub="Lasso")

plot(default.lm_cv.ridge, sub="Ridge")

plot(default.lm_cv.elastic,sub="Elasticnet")

get_deviance <- function(model) {
  best_lambda_i <- which(model$glmnet.fit$lambda == model$lambda.min)
  deviance <- model$glmnet.fit$dev.ratio[best_lambda_i]
  return(deviance)
}

# Binomial Deviance Plots
default.lm_cv.lasso <- cv.glmnet(as.matrix(default[1:default.num_features]),
                        as.matrix(default["default"]),
                        family="binomial",
                        type.measure = "deviance",
                        alpha=1)

print(c("Best lambda's deviance for Lasso: ", get_deviance(default.lm_cv.lasso)))
## [1] "Best lambda's deviance for Lasso: " "0.120538220700719"
default.lm_cv.ridge <- cv.glmnet(as.matrix(default[1:default.num_features]),
                        as.matrix(default["default"]),
                        family="binomial",
                        type.measure = "deviance",
                        alpha=0)

print(c("Best lambda's deviance for Ridge: ", get_deviance(default.lm_cv.ridge)))
## [1] "Best lambda's deviance for Ridge: " "0.119432088916329"
default.lm_cv.elastic<- cv.glmnet(as.matrix(default[1:default.num_features]),
                        as.matrix(default["default"]),
                        family="binomial",
                        type.measure = "deviance",
                        alpha=0.5)

print(c("Best lambda's deviance for Elastic: ", get_deviance(default.lm_cv.elastic)))
## [1] "Best lambda's deviance for Elastic: "
## [2] "0.120602274681122"
plot(default.lm_cv.lasso, sub="Lasso")

plot(default.lm_cv.ridge, sub="Ridge")

plot(default.lm_cv.elastic,sub="Elasticnet")

#MSE - Mean Square Error Plots

default.lm_cv.MSE.lasso <- cv.glmnet(as.matrix(default[1:default.num_features]),
                        as.matrix(default["default"]),
                        family="binomial",
                        type.measure = "class",
                        alpha=1)

default.lm_cv.MSE.ridge <- cv.glmnet(as.matrix(default[1:default.num_features]),
                        as.matrix(default["default"]),
                        family="binomial",
                        type.measure = "class",
                        alpha=0)

default.lm_cv.MSE.elastic<- cv.glmnet(as.matrix(default[1:default.num_features]),
                        as.matrix(default["default"]),
                        family="binomial",
                        type.measure = "class",
                        alpha=0.5)

plot(default.lm_cv.MSE.lasso, sub="Lasso")

plot(default.lm_cv.MSE.ridge, sub="Ridge")

plot(default.lm_cv.MSE.elastic,sub="Elasticnet")

Above, we have reported plots for all three different methods we have explored: Lasso, ridge and Elasticnet(with alpha = 0.5). There are in total three different types of plots: AUC with respect to log(lambda) (area under curve), Binomial Deviance with respect to log(lambda) and, finally, MSE (mean square error) with respect to log(lambda).

As mentioned by Tanmay on piazza: " for logistic regression you can get 3 error metrics that can be plotted against loglambda - binomial deviance, AUC, misclassification error."

Essentially, the goal here was to explore these three different methods and simply report the most ideal regularization. In each case the cross validated glmnet allows us to extract the new accuracy of our model on test data.

From the plots it seems more obvious that Lasso gives us the best regularization model. This was mainly apparent from the binomial deviance and Misclassification error plots.

1.Ridge:

  • deviance(for best regularization): 0.9318225

  • MSE: ~0.194

2.Lasso:

  • deviance(for best regularization): 0.9311568

  • MSE: ~0.189 (depending on cv) can be bigger or smaller than Elasticnet

3.Elasticnet:

  • deviance(for best regularization): 0.9341511

  • MSE:~0.189 (depending on cv) can be bigger or smaller than Lasso

[1] “Best lambda’s deviance for Lasso:” “0.120565993933999”
[1] “Best lambda’s deviance for Ridge:” “0.119432088916329”
[1] “Best lambda’s deviance for Elastic:” “0.120620761231308”

Hence, as predicted visually before, our results are confirmed. Lasso seems to have the best regularization in this case and for this particular iteration. We could also report the accuracy with respect to some test data, the results would be indeed very similar and cv.glmnet allows us to avoid this additional step by having an initial prediction values reported above.

Question 3 - Cancer

Here we use a wide dataset, meaning we have more attributes than actual data. From the class notes it is clearly mentioned that a wiser approach is to use lasso and if not even an elasticnet. As mentioned earlier, Lasso will allow use to ignore certain independant variables who result in zeros after our regularization, thing that is not allowed in the ridge process.

cancer <- read_csv('./cancer.csv', col_names = FALSE)
cancer.num_features = dim(cancer)[2]-1
colnames(cancer)[2001] <- 'state'

# AUC
cancer.lm_cv.lasso.auc <- cv.glmnet(model.matrix(~ ., data=cancer[1:cancer.num_features]),
                        cancer$state,
                        family="binomial",
                        type.measure = "auc",
                        alpha=1, nfolds=5) #not enough data for more than 6 folds
plot(cancer.lm_cv.lasso.auc)

# Deviance
cancer.lm_cv.lasso.deviance <- cv.glmnet(model.matrix(~ ., data=cancer[1:cancer.num_features]),
                        cancer$state,
                        family="binomial",
                        type.measure = "deviance",
                        alpha=1)

plot(cancer.lm_cv.lasso.deviance)

Both measures use about the same \(\lambda\) Regarding the number of genes, we realized that the deviance measure reported a slightly higher number than AUC when we ran it multiple times. AUC uses about 10 genes, but deviance uses about 15-20 genes.